Quantitative methods Flashcards

Question

Bagging aka bootstrapping samples original data or old data? What type of samples are used?

Answer 1

Bagging uses original data and reduces teh incidence of overfitting. New data bags are produced from random sampling.

Answer 2

Begins with one cluster divided into smaller clusters. Top down process until each cluster has only one observation Supervised learning

Answer 3

Identifying major correlated data factors and reducing them into fewer uncorrelated variables, a form of unsupervised learning.

Answer 4

Adds a penalty as the number of included

Answer 5

Splitting data into two categories using decision trees and binary branching to classify observations. It makes no assumptions about data sets. They are used to minimise classification errors. CART is a form of supervised machine learning

Answer 6

Ensemble learning results in more accurate and more stable models. Ensemble learning can aggregate both heterogenous and homogenous learners.

Answer 7

Base error arises from randomness of data Bias error arises when a model does not fit training data well Variance error arises when the model fits too well creating noise.

Answer 8

When a machine learning model learns the input and target data set too precisely. A non-linnear function error. The opposite and suseptible to linnear function errors

Answer 9

Centroids k-means clustering is when an algorithm iterates until no observations are moved to new clusters. It requires a define number of groups 'k' being the number of data inputs.

Answer 10

The proportion of total variance explained by an eigenvector from the initial data.

Answer 11

Agglomorate clustering is bottom up and final clusters contain all items/observations. Clusters increase in size. Divisive clustering = Top down, final cluster contains one observation

Answer 12

k-fold cross validation gives an estimation of 'ou-of-sample' errors

Answer 13

classification trees undertake classification to reduce a problem of overfitting.

Answer 14

ML Regression problem target is continuous ML classification problem target data is either categorical or ordinal

Answer 15

Hidden layer The activation function increases/decreases strength of input.

Answer 16

Deep learning nets aka Artificial Neural Networks BOTH supervised or unsupervised! What separates them is the many hidden layers (at least three)

Answer 17

Non-linner function which adjusts the strength of an unput.

Answer 18

The number of nodes is determined by the number of dimensions in a feature set.

Answer 19

Both forward and backward training is used for a neural network.

Answer 20

Reinforcement learning has neither labelled data nor can it give instantaneous feedback.

Answer 21

Web spidering which gathers raw data. Data exploration feature selection for out of sample data and Feature engineering for optimising selected features.

Answer 22

``` Mutual Information Token dollar appearing in all class of text is assigned a = 0 ``` Token dollar appearing in one class of text is assigned a = 1

Answer 23

1. Data Exploration (model features engineered) 2. Feature selection 3. Feature engineering

Answer 24

Standard ML models = Structured data Text ML models = Unstructured data

Answer 25

ML Iterative process Step 1 = Conceptualization Step 2 = Reconceptualization

Answer 26

Volume (quantity of data) Variety (array of data) Velocity (speed of data creation) Veracity (reliability/credibility of data)

Answer 27

Trimming removes outliars vs Filtration removes unrequired data vs Winsorigation is data wrangling (preparing data for ML model) where outliars high and low are replaced.

Answer 28

Precision of data formula = TP / TP + FP Recall data formula = TP / TP + FN Accuracy formula = TP + TN / (TP + FP + TN + FN) FP = Type 1 error FN = Type 2 error

Answer 29

Area Under Curve value normal: 0.5 AUC showing random guessing and higher convexity = > 0.5 ie 0.67

Answer 30

3 stages of model training = 1. Method selection 2. Performance evaluation 3. Tuning Underfitting more likely from small data sets

Answer 31

Simulations provide FULL distribution in addition to expected values. 3 data types of simulation data = 1 Historical data 2. Cross sectional data 3. Adopting a statistical distribution.

Answer 32

1. Determine probablistic vairables 2. Define probability distributions 3. Check for correlations What is a probabilistic variable is a trade between number of variables and complexity of the simulation.

Answer 33

Simulations = Accomodates both sequential and concurrent Decision trees = better accommodates sequential risk Scenario analysis =better accommodates concurrent risk

Answer 34

Slope coefficient is close to value: 1 An AR(1) model is tested using the Dicky fuller test and it tests for random walk.

Answer 35

``` 90% = 1.6 95% = 2 99% = 2.6 ``` Coefficient +/- SE x confidence interval

Answer 36

High bias error indicates underfitting a dataset High variance error indicates overfitting a dataset

Answer 37

Precision = TP / TP + FP Recall = TP / TP + FN Accuracy = TP + TN / TP + TN + FP + FN F1 = 2 x P x R / (P + R)

Answer 38

Text wrangling (preprocessing) can be essential in making sure you have the best data to work with. it requires performing normalization and involves the following: lowercasing removing stop words such as "the" and "a" because of their many occurrences. stemming: cutting down a token to its root stem.

Quantitative methods Flashcards

(62 cards)