Quant Methods Flashcards

Question

**T-Test: Specified Value**

Answer 1

Determining whether a regression coefficient is **significantly different from a _specified value_** e.g. **_1_** Tcalc: bi - **_1_** / S.E. Tcrit: 2 (given in CFA) TCalc \> TCrit (in absolut) = REJECT NULL (H0: b1 = 0) then b1 not equal to 0 = SIGNIFICANT

Answer 2

“The percentage of the total variation in the dependent variable (Y) that is explained by the regression equation”

Answer 3

* The problem with **R² is that it will automatically increase if new independent variables are added**, even if the new variable adds very little to the regression * _Adjusted R_² **takes into account the number of independent variables** * It will only increase if the new independent variable pulls its weight Example: Adding in a 4th variable and R²increases (which is good). However, **Adjusted R²decreases and that is WORSE**. Prefer option were R2 stays the same / gets worse and Adjusted is flat. Interpret rather than use formula.

Answer 4

* Qualitative variables are important - E.g. investor confidence * Incorporate by dummy variables - Assigned either “1” or “0” * If you want to describe j circumstances with dummy variables you need **j-1** dummy variables - E.g. month of year effect requires 11 dummy variables Write a suitable regression **equation** and **test significance** (t-test: Tcalc with [b1 / S.E.] \> Tcrit = **REJECT** = **Significant**]

Answer 5

Variance of the error terms is constant across all of the observed data

Answer 6

Variance of the error terms is **not constant** across all of the observed data Testing for conditional heteroskedasticity: Breusch-Pagen test

Answer 7

* Regress the squared errors against each independent variable * Determine R² of these regressions * If no conditional heteroskedasticity there will not be a strong relationship * If a **high R² there may be a strong relationship** * But also need to consider the number of observations

Answer 8

How we would correct for conditional heteroskedasticity: 1. Compute **robust standard errors** 2. Modify the regression equation by using generalized least squares method Robust standard errors correct T_calc

Answer 9

* The residuals of a regression are correlated across observations, so that a positive (or negative) error in one observation **affects the probability** that there will be a positive (or negative) error in the next observation (*previous error predicts the next error; E:E*) * Effect is that standard errors **may be incorrect** * Thus **we may incorrectly reject/fail to reject null hypotheses** about the population values * If one or more of the independent variables is a lagged value of the dependent variable, then serial correlation causes all regression parameters to be invalid – very serious problem as you may be performing the wrong type of regression * **Detect with Durbin-Watson statistic**

Answer 10

**DW = 2 \* (1 - r)** * Obtain the critical value of the DW statistic (given in exam) * If positive correlation * **H₀ : No positive autocorrelation** * IF DW_calc \< dl reject H_o * IF DW_calc \> du do not reject H_o _Example_: * DW Statistic = 1.87 * Assume the lower and upper critical values are 1.61 and 1.74 =\> DWcalc (1.87) \> du (1.74) =\> **do not reject** = No positive autocorrelation =\> if DWcalc was 1.65 =\> **Inconclusive** =\> if DWcalc was 1.00 =\> smaller than dl =\> **REJECT** =\> **+ve Autocorrelation**

Answer 11

1. **Hansen method** of adjusting the standard errors of the regression coefficients upwards 2. **Change the regression equation** so that the autocorrelation is eliminated (do something different!!!!) Hansen **adjusts** for both serial correlation and heteroskedasticity. It does **not eliminate** serial correlation.

Answer 12

_Definition_ * Multicollinearity occurs when **two or more independent variables** (or combinations of independent variables) in a regression model **are highly (but not perfectly) correlated** with each other (x:x * Estimates of regression coefficients will be unreliable * Cannot distinguish individual impacts of independent variables _Detection of multicollinearity_ * **High R²**(this works - your equation is predicting is movement in y) * **Significant F-stat** (At least one bi is significant) * but **low t-stats on each regression coefficient** (due to overstated standard errors) - not significant: might be prooffor Multicol. * Can also be tested by pairwise **correlation matrix** but only when there are _two_ independent variables (just look at correlation of each two if close to -/+ 1 = multicollinear) _Correcting for multicollinearity_ * Reformulate the regression model, **leaving out variables** that appear to be redundant * **Rerun** the regression model * *In practice it can be difficult to determine which variables to exclude so experimentation may be necessary*

Answer 13

1. Model should be grounded in sensible economic reasoning - E.g. **avoid data mining** 2. Functional form of variables should be appropriate - E.g. use logs of inputs if appropriate 3. Model should be parsimonious i.e. achieving a lot with a little 4. Model should be **examined for violations of regression assumptions** before being accepted 5. Model **should be tested ‘out of sample’**, i.e. use new sample data before being accepted

Answer 14

1. One or more important **variables are omitted** (forget to put a variable in) 2. One or more of the regression variables **may need to be transformed** - E.g. using natural logs for exponential data (or from millions in thousand) 3. Data from **different samples is pooled**, e.g. using data from different stages of a company’s growth (mixing relationships)

Answer 15

Qualitative dependent variables are where dummy variables are used as dependent rather than independent variables There are three main models: 1. **Probit** * Estimate the probability of a discrete outcome (e.g. that a company will go bankrupt). Uses **normal distribution** 2. **Logit model** * is based on the ‘**logistic distribution**’, a simplified version of the normal distribution that was useful before computers were developed 3. **Discriminant analysis** * Yields a **linear function** that is similar to a regression equation that will create an overall ‘score’ for the dependent variable based on the values of the independent variables. If the score is above a certain number, the dependent variable is assigned a value of ‘1’; otherwise, it is assigned a value of ‘0’ Qualatative dependant output!!

Answer 16

A time series is a set of **observations on a variable’s outcomes in _different_ time periods** Models to use: **Trend model** (linear / log linear) & **Auto Regressive** (AR) Key issues: 1. How do we predict a future value based on past values? 2. How do we model seasonality? 3. How do we choose which models to use? 4. How do we model changes in the variance of the time series over time?

Answer 17

Probably serial correlation use DW to spot it

Answer 18

* Residuals are **often serially correlated**, tends to bias standard errors of regression coefficients downward (E:E; if overstated last period = overstate again) * This violates regression assumptions * **Testing for serial correlation** * _Durbin Watson_ test (see overleaf for reminder of DW test) * Plot graph of Y against time and superimpose linear regression trend regression model estimates, _judge it by eye_

Answer 19

A trend model in which the logarithm of the dependent variable (lnYt ) is linearly related to time

Answer 20

If a trend model has **unacceptably high serial correlation** in its residuals, an **autoregressive** time series model **may solve the problem** An autoregressive time-series model is one in which the value of a time series in **one period (xt) is related to its value in previous periods** (xt-1, xt-2, and so forth). Valid statistical inferences can be made from autoregressive time-series models **only if the time series is _covariance stationary_**

Answer 21

In essence that its mean and variance do not change over time To be covariance stationary, a time series must satisfy three requirements: 1. The **mean** of the time series must be **constant and finite** in all periods 2. The **variance** of the time series must be **constant and finite** in all periods 3. The **covariance** of the time series with itself for a fixed number of periods in the past or future must be **constant and finite in all periods**. If a time series is not covariance stationary we cannot model it using an AR model. If time series is not covariance stationary we may be able to **transform** to one

Answer 22

* **CANNOT** use Durbin-Watson statistic * _T-test_ to test the **regression coefficients** and **autocorelations of the standard errors** (residuals) 1. The _regression coefficient_ is statistically significant at 5% because tstats are larger than 2.0 (IMPORTANT: Here Significant is good as X_t-1 is explaining X_t) 2. _Autocorrelations of standard errors_ all not significantly different to zero due to low tstats (good news). So no need to re-specify the model

Answer 23

Process of forecasting where uncertainty is added at each forecast period so **multi-period forecasts have more uncertainty** than single period forecasts

Answer 24

* A time series exhibits the property of mean reversion if it tends to fall when its level is above its **mean-reversion level** (MRL) and rise when its level is below its mean-reversion leve * **Covariance stationary** data will be mean reverting * b₁ = 1 =\> No finite MRL =\> **Not** covariance stationary * b₁ = 1 =\> Unit root =\> **Random walk** (not covariance stationary)

Answer 25

**Out of sample** forecasts: we tend to look at out of sample results to compare forecast accuracy of two different models because the future is always out of sample Typically compare performance using Root Mean Squared Error (RMSE): **√E²** The smaller the better Also consider **coefficient stability**!!

Answer 26

Regression coefficients are **not stable** over time Don’t use data to construct an AR model that crosses periods with very different underlying conditions, need to apply subjective judgment

Answer 27

A simple random walk is a time series whose **value in every period** **equals** its value in the **previous period** Special case of a first-order autoregressive time series model in which **b₀ is 0 and b₁ is 1** 1. Means that the best forecast of x_t is x_t-1,because expected value of error term is zero 2. Note that it is not xt that is random, but the variable xt - xt-1 3. Random walks have an undefined mean-reverting level 4.  Random walk is NOT covariance stationary

Answer 28

autoregressive time series model in which **b₀ is not 0** and **b₁ is 1** then we have a random walk with a drift Means that the best forecast of x_t is b₀ + x_t-1 ,because expected value of error term is zero The problem with all random walks is that because the data is **not covariance stationary** Can convert data to a covariance-stationary time series by **first differencing** First-differenced series will have no predictive value but will help us conclude that the original series was a random walk.

Answer 29

?? **y_t = b₀ + ε_t** Even though this does not help us to make predictions it is **nonetheless covariance stationary**

Answer 30

Unit root = when a lag coefficient is not significantly different to one Model not covariance stationary (need lag coefficient of less than 1) If **lag coefficient = 1** then we have a **random walk**. By definition all random walks have unit roots If lag coefficient \>1 then we have an **explosive root** Need to transform into covariance stationary form with First Differencing **Test for Unit root using Dickey Fuller test**

Answer 31

Test for unit root: DFT **to see if b₁ - 1 (g) is significantly different to zero** **b₁ -1 = g** b₁ = 1 then g = 0 thus **H0 : g = 0** Calculate t stat as usual and compare to a Dickey-Fuller critical stat. - If **reject** null then **do not have unit root** problem

Answer 32

Test Lags with T-test If TCalc \> TCrit = REJECT = Significant Add lag ie 4th lag significatn = re-specify the model to include a seasonal lag of one-year To test if the new model is correct, retest for seasonality Once specified correctly can be used for forecasting

Answer 33

To test for such a relationship: **ARCH Test** 1. (Regress the squared error terms on the previous period’s error terms) 2. If the regression coefficient (a1 ) of this ARCH(1) model is statistically significant (**T-TEST**), the error terms in the **model are ARCH(1)**

Answer 34

Use the ARCH equation to predict the variance of error terms in the **t+1 period**

Answer 35

Two (or more) time series might not be stationary, e.g. have unit root problem, but if we **regress the series against each other we might find we have a (covariance) stationary series** – this is called _cointegration_. If the series are **cointegrated** then the error term above will **_not have a unit root_** **Example**: Regressing the price of a stock market index and also the associated future contract. Each one individually might exhibit a random walk however we would intuitively expect a stable relationship between them. Only reliable for modelling where there is a long run, stable relationship. **Testing for cointegration**: The (Engle-Granger) Dickey-Fuller test

Answer 36

**Testing for cointegration** Check whether Error Term has a unit root. If the series are cointegrated then the error term above will not have a unit root. If we **reject** (significant different) the null then we conclude the error term _**is** **covariance stationary = no unit root = cointegrated**_

Answer 37

* **Extracting knowledge** from large amounts of data (big data) * **Goal of automating the decision-making** process by * **‘Learning’ from known examples** to determine an underlying structure in the data Find the pattern, apply the pattern. Broadly categorized into three distinct classes of techniques: 1. **Supervised** learning 2. **Unsupervised** learning 3. **Deep** Learning

Answer 38

Requires the use of a labelled data set i.e. matched set of observed inputs and the associated output The ML algorithm is ‘trained’ using the labeled data set to infer the pattern-based prediction rule between the inputs and output - The ‘fit’ of the ML model is evaluated using labelled test data where the predicted targets (Y predicted) are compared to the actual targets (Y actual) - Two categories of problems: 1. Regression problems where the target variable is continuous (even if the ML technique used isn’t regression) 2. Classification problems where the target variable is categorical or ordinal e.g. fraudulent or non-fraudulent transactions

Answer 39

Does **not** make **use of labeled data**, the ML algorithm seeks to discover structure within the data set **Two types of problems** suited towards unsupervised ML: 1. Dimension reduction aims at reducing the number of features used whilst retaining variation across observations e.g. identifying major factors underlying asset price movements 2. Clustering aims on sorting observations into groups (clusters) based on similarity that may or may not be pre-specified (for example, the number of groups) e.g. sorting companies into financial statement data groups

Answer 40

complex and sophisticated algorithms tackle highly complex tasks such as image and speech recognition In Reinforcement learning a computer learns from interacting with itself Both are based on artificial neural networks (ANNs), and can be supervised or unsupervised

Answer 41

Represents a model that **fits its training data too well** (i.e. the incorporation of noise or random fluctuations) and **does not predict using out-of-sample data** Low or no in-sample error (Ein) but large out-of-sample error (Eout) represents poor generalization / **overfit**! **Main contributors to overfitting:** 1. High noise levels 2. Too much complexity in the model i.e. features in the model, number of branches, linear or nonlinear relationship

Answer 42

1. **Bias Error** – the degree to which the model fits the training data (**underfit**?) * ML models with erroneous assumptions produce high bias and poor approximations which results in underfitting and high in-sample error 2. **Variance Error** – a measure of how much the model’s results change in response to new data from the validation and test samples * An unstable model will pick up ‘noise’ and produce high variance causing **overfitting** and high outof-sample error 3. **Base Error** – error due to randomness in the data

Answer 43

High Variance Error = Over-fitting High Bias Error = Under-fitting

Answer 44

A fitting curve show in- and out-of-sample error rates (Ein and Eout) against model complexity Typically: * **Linear** functions are more susceptible to **bias error and underfitting** * **Non-linear** functions are more susceptible to **variance error and overfitting** An optimal point (managing overfitting risk) of model complexity exists where the **bias and variance error curves intersect and where Ein and Eout rates are minimized**

Answer 45

1. Estimation of an **overfitting penalty** that increases in size with the number of included features * Prevents the algorithm from getting too complex during the selection and training process * Only include parameters that reduce out-of-sample error 2. **Cross-validation** * A process aimed at reducing sampling bias * The challenge is to have a large enough data set to partition the data into representative groups for training, validation and testing (holdout sample). * **k-fold cross validation**: data (excluding a test sample) is shuffled randomly and split into k equal size sub-groups (typically 5 or 10), with k-1 groups used as training samples and one sample (the kth) used as a validation sample. The process is repeated k times so each data point is used in the training data set k-1 times and in the validation data set once. The average of the k-validation errors (mean Eval) taken as an estimate of the model’s Eout

Answer 46

Penalized regression is a process of regularisation that helps **reduce the effect of ‘overfitting**’ a model -A penalty term is created that increases in size as the number of included variables in the model increases e.g. in Least Absolute Shrinkage and Selection Operator (**LASSO**):

Answer 47

A very popular ML algorithm used for classification, regression, and outlier detection SVM is a linear classifier that determines a hyperplane (e.g. a line) that optimally separates the data into two sets of data points

Answer 48

* Supervised ML technique used commonly for classification and sometimes for regression * Aims to classify a new observation by identifying similarities between the new observation and the existing data * KNN is a straightforward, intuitive, non-parametric technique that can be used in a multiclassification situation * However, defining the term ‘similar’ can be difficult * The number of K (hyperparameter) in the model must be carefully chosen: * Too small: Results in a high error rate and sensitivity to local outliers * Too big: Dilutes the concept of nearest neighbor by averaging too many outcomes * Even: May result in ties and no clear classification

Answer 49

ML technique **used to predict** either a: 1. Categorical target variable, i.e. a classification problem, producing a classification tree, or 2. Continuous outcome, i.e. a regression problem, producing a regression tree Algorithm produces a visual decision tree with binary branching to classify observations CART makes no assumptions about the characteristics of the training data - Therefore, if left unconstrained it can be subject to overfitting. This can be mitigated by the introduction of regularization parameters: • Maximum depth of tree • Minimum population at each node • Maximum number of decision nodes

Answer 50

**Combining** the predictions from a **collection of models** to create an **average predicted value** **Heterogeneous** learners: different types of algorithm combined together with a voting classifie **Homogenous** learners: combination of the same algorithm using different training data

Answer 51

**Unsupervised** ML Algorithms: Process used to summarize or reduce highly correlated features into a few main, uncorrelated composite variables

Answer 52

Clustering groups solely on the basis of information found in the data with no pre-determined labelling. A cluster is created on a sub-set of data that is deemed to be ‘similar * **Cohesion** – observations in each cluster are similar to each other * **Separation** – observations in two different clusters are as dissimilar as possible Uncovers potentially interesting and novel relationships not previously identified using standard classifications to group companies such as industry and sector Two popular approaches include: • K-means clustering • Hierarchical clustering

Answer 53

Iterative process of repeatedly partitioning data into a fixed number, k, of nonoverlapping clusters

Answer 54

n iterative procedure that builds a hierarchy of clusters. The algorithm creates intermediate rounds of clusters that are of: 1. **Increasing size: Agglomerative** – used in large datasets because of its fast computing speed. It makes decisions on local patterns without an initial global structure, therefore, it’s good at identifying smaller clusters 2. **Decreasing in size: Divisive** – starts with an initial global structure and is better suited to identifying large clusters.

Answer 55

Characteristics of Big Data: 1. **Volume**: Data collected in files, tables and datasets is large 2. **Velocity**: The speed at which data is communicated is great! Real-time data is becoming the norm in many areas 3. **Variety**: Data is collected from many **different sources** and in many different formats: - **Structured** data such as SQL tables and CSV files - **Semi-structured** data such as HTML code - **Unstructured** data such as video data When using data for inference or prediction, there is a “Fourth V”: 1. **Veracity:** Relates to the credibility and reliability of different data sources e.g. fake news and spam emails • Identifying quality from quantity!

Answer 56

1. **Conceptualization** • Determining what the inputs, and output of the model e.g. will the stock price rise or fall in a week’s time? • How will the model be used, and who will use it? • How will the model be incorporated into the business’ processes? 2. **Data Collection** • Mostly data collected from internal and external sources in a structured form, e.g. cells with values • External data can be accessed through an application programming interface (API) which allows communication between different software components 3. **Data Preparation and Wrangling** • Cleansing the data to resolve missing values or out-of-range values • Preprocessing the data: Extracting, aggregating, filtering, and selecting relevant data columns 4. **Data Exploration** • Involves exploratory data analysis, feature selection, and feature engineering 5. **Model Training** • Selecting the appropriate ML method(s) • Evaluating the performance of the trained model • Tuning the ML model

Answer 57

1. **Text Problem Formulation** • Identify the inputs and outputs, e.g. identify a sentiment score that is structured output from an unstructured input, like text 2. **Data (Text) Curation** • Gathering external text data via web services or web spidering (scraping or crawling) programs that extract raw content from a source, like web pages 3. **Text Preparation and Wrangling** • Cleaning and preprocessing to convert the unstructured text into a format that can be interpreted by traditional modeling methods designed around structured inputs 4. **Text Exploration** • The process of visualizing the text using techniques such as word clouds • Also, text feature selection and engineering 5. **Model Training** The output resulting from the process could be combined with other structured variables or used directly for forecasting and analysis. - The detail of steps 3 and 4 vary between structured data versus text based (unstructured) data. We will go on to look at these points in more detail.

Answer 58

**Data Preparation (Cleansing)** * The process of examining, identifying, and mitigating errors in raw data * Common issues include missing, duplicated, erroneous or inaccurate values * Automated data can have similar issues due to software bugs and server failures **Data Wrangling (Preprocessing)** * Involves the transformation and processing of the cleansed data so that it is ready to be used for ML model training * The data may be processed to deal with outliers, extraction of useful variables from the existing data, and also scaling the data Different for Structured data // Unstructured (Text) data

Answer 59

**Possible errors in a raw dataset** (e.g. a table) include: * **Incompleteness** error – data is not present i.e. missing value • Seek alternative sources • Missing values and NAs must be omitted or replaced with “NA” for deletion or substitution of an imputed value (e.g. the mean, median or mode or assume 0) * **Invalidity** error – data is outside of a meaningful range, creating invalid data - Inaccuracy error – data is not a measure of true value * **Inconsistency** error – data conflicts with corresponding data points or reality e.g. a title column shows ‘Mrs.’ when the sex column states ‘male’ * **Non-uniformity** error – data is not present in a consistent format e.g. GBP and £ - * **Duplication** error – where duplicate observations are present

Answer 60

* Predominantly the transformation and scaling of data on the cleansed data set * Common transformations used in practice include: * **Extraction** – new variable extracted from a current variable e.g. Age from observed DoB * **Aggregation** – consolidation of two or more similar variables into one variable e.g. capital gains/losses and income combined to give total return * **Filtration** – data rows not required must be identified and filtered * **Selection** – data columns not intuitively needed can be removed * **Conversion** – the data (nominal, ordinal, continuous, categorical) may need to be converted in order to be processed further, e.g. removal or prefixes such as currency symbols * Outliers need to be identified in order for them to be removed or replaced. Several techniques exist. Data values that are outside of: * - 3 standard deviations from the mean, or - 1.5 times the inter-quartile range + 3rd Quartile upper bound * There are several methods to deal with outliers: * **Trimming** – removal of the outliers and extreme values * **Winsorization** – extreme values and outliers are replaced with the maximum (for large outliers) and minimum (for small outliers) values of data points that are not deemed to be outliers * **Scaling** * ****- The process of adjusting the range of a feature by shifting and changing the scale of the data * Required for ML techniques requiring scaled data e.g. an neural network * Two common methods: **Normalization** (sensitive to outliers) & **Standardization** (assumes normal distribution & less sensitive to outliers)

Answer 61

Basic operations in the text cleansing process includes removing: * **HTML tags** - Required if the text is obtained from website * **Punctuations and numbers** - Generally, they are removed as words found in the sentence infers meaning, e.g. the presence of the word “boosted” in an earnings press release may indicate positive sentiment (rather than the number figure) - However, sometimes they can be useful e.g. % sign (would be replaced with the annotation /PercentSign/ to preserve grammatical meaning in the text) * **White spaces** - Removal of unnecessary white spaces that might have occurred because of the removal of punctuations and numbers

Answer 62

* Involves the process of **tokenization**: Process of splitting text into separate tokens (e.g. words) - Can be done at a character or word level (most common) * The **normalization** process involves the following: * **Lowercasing** • Removes the distinction among the same words e.g. “It” and “it” * **Stop words** • Such as “is”, “the” and “a” don’t always carry a semantic meaning so they are often removed at this stage (or maybe later in the data exploration stage because of high word frequency) * **Stemming** • Converting inflected forms of a word into a base word e.g. “fishing“, “fished“, and “fisher“ to the stem “fish” * **Lemmatization** • Converting inflected forms of a word into its morphological root known as a lemma • Requires an understanding of the relevant dictionary and is more expensive and advanced

Answer 63

Procedure used to analyze text and is a collection of a distinct set of tokens observed from all the texts in a sample data set The final BOW created **after normalization** can be viewed in a **document term matrix (DTM)** which makes the text **more structured** An **N-grams technique** can be used to attach words together to show representation of word sequences e.g. a bigram such as “not\_present”. This ensures the term “not” isn’t considered a single token that may have been removed during normalization.

Answer 64

Involves three vital tasks: 1. **Exploratory Data Analysis (EDA)**: This preliminary step of data exploration involves the creation of graphs, charts, heat maps and word clouds. EDA helps stakeholders connect and ensure the prepared data is sensible. EDA also allows for inspection of simple questions and hypotheses which enables planning for the next stage 2. **Feature Selection**: Where only the key features from the dataset are selected for ML model training 3. **Feature Engineering**: Process of creating new features by changing or transforming existing features 2 & 3 heavily influences model performance!

Answer 65

1. **Exploratory Data Analysis (EDA)** 1. ****Principal Components Analysis (PCA) can be used on high-dimension data 2. Exploratory visualization for **one-dimensional** data (bar charts etc) 3. Exploratory visualization for **two-dimensional** data includes scatterplots etc 2. **Feature Selection** 1. **Removal** of unneeded, irrelevant, and redundant features to achieve model parsimony 2. Basic **diagnostic tests** are carried out to identify: - Feature redundancy - Heteroskedasticity - Multicollinearity 3. **Dimension reduction** is carried out which creates new combinations of features that are uncorrelated which helps to reduce cost and increase processing speeds 3. **Feature Engineering** 1. This process helps to further o**ptimize and improve the features** e.g. categorizing ages into either retirement and non-retirement age features 2. For categorical data it may involve **one hot encoding** where a categorical feature is converted to a binary outcome of 0 or 1, e.g. is\_RetirementAge assigned “0” for false, and “1” for true

Answer 66

* Most common text analytical procedures are: * Text classification – supervised ML to classify texts into different classes * Topic modelling – unsupervised ML that groups texts into topic clusters * Fraud detection * Sentiment analysis – both supervised and unsupervised ML to predicting the sentient of texts * Statistical measures used as part of EDA on text data: * Term (or Collection) Frequency (TF) = No. of times a given token occurs in all texts/total number of tokens, and allows the analyst to identify (and potentially remove) noisy terms * Word associations * Average sentence and word length * Word and syllable count * Word clouds are a common **visual technique** used

Answer 67

* For text data this involves selection of a subset of tokens occurring in the dataset, these represent features of the data set. * Noisy features represent the most infrequent and most frequent tokens in the dataset (e.g. stop words). Identification and removing this noise is an important task * General feature selection methods include: 1. **Frequency measures** 2. **Chi-square test:** • Used to test the independence of two events e.g. occurrence of the token vs. the occurrence of the class • Useful for ranking – tokens with the highest test statistic occur more frequently in texts associated with a particular class and may be selected as a feature 3. **Mutual information (MI)**: Measure how much information is contributed by a token to a class of texts • Value of “0” if the token appears equally in all classes or “1” if it occurs in only one class of tex

Answer 68

This process is similar to techniques used for structured data Techniques used include: * **Numbers**: Numbers of certain length could be identified as a particular token, e.g. 5-digit number representing a telephone area code in the UK. A feature labelled /number5/ could be created to represent a token * **N-grams** * **Name Entity Recognition (NER) and Parts of Speech (POS)** • Algorithms used to analyze individual tokens and their surrounding semantics whilst referencing to a dictionary in order to tag an object class to the token, e.g. taking a sentence and attaching labels such as verb, noun, percent, time, money etc.

Answer 69

1. **Method Selection:** Deciding which ML method(s) to use (ML section) 2. **Performance Evaluation:** Techniques and measures used to quantify and understand the model performance 3. **Tuning:** Decisions and actions to improve the model performance Iterative process: Repeated many times until the desired level of model performance is attained

Answer 70

* Factors to consider when selecting the ML method or algorithm to be used include: * Supervised or Unsupervised * Type of Data * Size of Data * Once the method is selected, certain method-related decisions need to be made, i.e. hyper-parameters e.g. number of hidden layers in a neural network * Data needs to be split before training begins: * In-sample data: Training sample (60%) * Out-of-sample data: - Validation sample - Testing sample (40%)

Answer 71

The process of measuring the ‘Goodness of Fit’ of the ML model - Several techniques are used and we will discuss methods suited to binary classification models 1. **Error Analysis** - The computation of four basic evaluation metrics: A confusion matrix is used to summarize the above metrics * True positive (TP) * False positive (FP) - Type I error * True negative (TN) * False negative (FN) - Type II error 2. **Receiver Operating Characteristic (ROC)** - Assesses model performance by plotting a curve that represents the trade-off between the false positive rate and the true positive rate for various cutoff points (for the observation to be classified as either “0” or “1”) * False Positive rate = FP / (TN + FP) * True Positive rate (Recall) = TP / (TP + FN 3. **Root Mean Squared Error** (RMSE) * Appropriate for continuous data predictions and is commonly used in regression * A single metric capturing all the prediction errors in the data (n) * Square root of mean of the squared differences between actual values and the model’s predicted values

Answer 72

* Once the model has been evaluated, based on the findings, the performance of the _model needs to be improved:_ * High prediction error on the training set = **Underfit** * Prediction error on the cross-validation (CV) set is much higher than on the training set = **Overfit** * _Two types of error in model fitting_: * **Bias** error: - Model is overly simplified and does not learn adequately from the training data - Associated with **underfitting** * **Variance** error: - Model is overly complicated and starts to memorize the training data and therefore performs poorly on new data - Associated with **overfitting** * It is not possible to remove both, however, it is possible to minimize the total aggregate error (bias and variance error) * Hyperparameters must be chosen in advance, e.g. regularization term (λ) in a supervised model, number of hidden layers in a NN * Grid search is a method of systematically training a ML model by using various combinations of hyperparameters and **choosing the one with best model performance** * Results can be analyzed using a **fitting curve**

Answer 73

1. **Very low regularization:** * Prediction error on the training set is small (memorizing the data) but high on the cross validation data set * High variance error and low bias error * Model is overfitted as it does not perform well on new data 2. **Very high regularization:** * Too few features included so the model is unable to learn * High prediction error on both the training (suggesting high bias) and CV datasets * Suggests model underfitting 3. **Optimum regularization** * Minimizes total error in a balanced fashion, with prediction error in the training and CV datasets that are similar

Quant Methods Flashcards

(107 cards)